class: middle, inverse # Module 1 - First steps with R and RStudio .section-subtitle[ - R and RStudio - Interacting with R - Data types and data.frames - R packages - Data import ] --- class: middle # Introduction to R and RStudio --- # Overview of R <style type="text/css"> .r-console-img { margin: 0; position: absolute; top: 60%; right: 0; -ms-transform: translate(-25%, -50%); transform: translate(-25%, -50%); } </style> .pull-left[ - Versatile statistical computing software + language used for - data analysis - visualization - statistical modeling - data engineering - web applications - Many users from many fields - Biostatistics - Bioinformatics - Environmental Science - Social Sciences - Economics and Finance - Healthcare and Epidemiology - and more ... - R is open-source and community driven ] .pull-right[ <div class="figure"> <img src="data:image/png;base64,#images/r-terminal-screenshot.png" alt="The R console" width="100%" /> <p class="caption">The R console</p> </div> ] --- # Interacting with R The basic interaction mode in R is one of **expression evaluation** using a **command line interface (CLI)**. - The user types an expression (code) into the **R console (CLI)** - The system evaluates it - The system returns the result -- Example ``` r 10 + 0.5 * 2 + 1.2 * 3 ``` -- ``` #> [1] 14.6 ``` -- ``` r plot(sin(1:10), type = 'l') ``` -- <!-- --> --- # R is built, and extended, with **packages** Bundles of code, data, documentation that do specific things and are loaded into R .footnote[ [1] What you first install without adding any other packages ] - The **base**<sup>1</sup> version of R contains many useful 'built-in' packages -- - We often install additional packages to **extend the functionality of R** -- .center.large[ External packages make R very popular and diverse in functionality ] <img src="data:image/png;base64,#images/to-infinity-and-beyond.jpeg" width="400px" style="display: block; margin: auto;" /> --- # Challenges for the new user <style type="text/css"> .challenges-computer-guy { position: absolute; bottom: 5%; right: 2%; width: 400px; } </style> .challenges-computer-guy[ <!-- --> ] - Writing **code** instead of using a Graphical User Interface (GUI) ``` r lm(response ~ predictor_1 + predictor_2, data = model_data) ``` -- - Interacting with the **computer file system** and with the **command line interface (CLI)** ``` r import.file("C:/Users/Your_name/Documents/R-course/student-material/file.csv") ``` -- - **Using the keyboard more** - less mouse clicking, keyboard shortcuts </br> </br> -- - **Data literacy** - datasets for computers and tidy data </br> </br> -- - Navigating **R packages** </br> </br> -- - **Project management** - scripts, data, outputs </br> </br> -- - Adopting **reproducible research** strategies </br> </br> --- # RStudio .pull-left[ - RStudio is an **Integrated Development Environment (IDE)**. - It **faciliates** the use of R. - It makes coding in R **more efficient and productive** - R scripts - syntax (code) highlighting - code completion - Provides **project management** allowing users to organise projects - And other, more **advanced** R tools ] .pull-right[ <div class="figure"> <img src="data:image/png;base64,#images/rstudio-screenshot.png" alt="RStudio" width="100%" /> <p class="caption">RStudio</p> </div> ] --- class: middle, center background-image: url(data:image/png;base64,#images/rstudio-screenshot-unlabelled.png) background-size: contain --- class: middle, center background-image: url(data:image/png;base64,#images/rstudio-screenshot-labelled.png) background-size: contain --- # Where does coding happen? <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#images/rstudio-in-action.gif" alt="Code to console" width="90%" /> <p class="caption">Code to console</p> </div> --- # Code organisation in "R scripts" We keep our code files called **"R scripts"** (.R files) ... these are simply **text files** <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#images/an-r-script.png" alt="An R script" width="100%" /> <p class="caption">An R script</p> </div> --- # Code organisation in "R scripts"- writing comments We use **`#` (hash)** to create comments in our scripts, and they are not interpreted as code ``` r # Title: A script to do a particular task # Date created: Today # Details: These are the details of my work # Last modified: Tomorrow my_data <- read.csv('a-data-file.csv') # This is will be used in future calculations days_in_6_months <- 365 / 12 * 6 # Can also be at the end of the line .. days_in_6_months <- 365 / 12 * 6 # This is will be used in future calculations ``` .footnote[ Rstudio and other IDEs will **colour text** based on it's purpose ] --- # Project organisation .pull-left[ <img src="data:image/png;base64,#images/rstudio-project-layout.png" width="85%" style="display: block; margin: auto;" /> ] .pull-right[ In a **main folder - the project folder**, keep together - scripts - data - outputs ] --- # Project organisation .pull-left[ <img src="data:image/png;base64,#images/rstudio-project-layout.png" width="85%" style="display: block; margin: auto;" /> ] .pull-right[ In a **main folder - the project folder**, keep together - scripts - data - outputs Use an **RStudio project file (.Rproj)** - to start RStudio within the project directory - allow RStudio to switch between different project easily ] --- # Project organisation .pull-left[ <img src="data:image/png;base64,#images/rstudio-project-layout.png" width="85%" style="display: block; margin: auto;" /> ] .pull-right[ In a **main folder - the project folder**, keep together - scripts - data - outputs Use an **RStudio project file (.Rproj)** - to start RStudio within the project directory - allow RStudio to switch between different project easily For collaboration / version control **share the project folder** - ensures work is reproducible - git/GitHub ] --- # Project organisation - RStudio projects An `.Rproj` file will open RStudio with the "working directory" the project folder - avoid file path problems, files/output relative to the project folder .pull-left[ <img src="data:image/png;base64,#images/rstudio-project-file.png" style="display: block; margin: auto;" /> ] -- .pull-right[ <img src="data:image/png;base64,#images/rstudio-project-screenshot.png" style="display: block; margin: auto;" /> ] --- # Start / convert projects into a RStudio project Use RStudio to **create a project file** for a new, or existing, project directories ... <br/> E.g. `File > New Project > follow the prompts shown` <img src="data:image/png;base64,#images/rstudio-create-project.png" style="display: block; margin: auto;" /> --- # R sessions and reproducible code When we start R and RStudio, we are provided with a **temporary workspace - an R session** where objects and data reside .pull-left[ - data and objects created **only exist for this session** - they are lost when the session ends ] .pull-right[ <style type="text/css"> .r-session-img { position: absolute; bottom: 5%; right: 2%; width: 400px; } </style> <img src="data:image/png;base64,#images/restart-workspace-2.png" width="100%" style="display: block; margin: auto;" /> ] -- Your goal is to create **well-structured**, **self-contained code** to **ensure reproducibility** .pull-left[ - Use **R scripts** to create workflows - They work from start to finish in an new R session ] .pull-right[ <img src="data:image/png;base64,#images/rscript-reproducible.png" width="100%" style="display: block; margin: auto;" /> ] --- class: middle # Before we do our first exercise ... --- # How we will do excerises - **student-material** folder In the course folder is a file `r-for-scientific-research.Rproj` and the folder `student-material/`. The student material folder contains a series of scripts e.g. `M01-01-xxx` which are the exercises we will work through throughout the course -- .pull-left-60[ - Open RStudio using `r-for-scientific-research.Rproj` - With the course project open - Use the **RStudio files browser** to open files in `student-material/` - Questions have placeholders for responses ``` r # Store the result into an object called `evens` _ <- c(2,4,6,8,10) ``` ``` r # Store the result into an object called `evens` evens <- c(2,4,6,8,10) ``` - We 'clear the workspace' for each new script - Ctrl / Cmd + shift + F10 - Noted at the top of each exercise to remind you ] .pull-right-40[ <img src="data:image/png;base64,#images/rstudio-exercises.png" width="90%" style="display: block; margin: auto;" /> ] --- # Tips and tricks to optimise your interaction with R </br> - **Use a mouse** - much easier to navigate, select and highlight text compared with a trackpad </br> - **Use keyboard shortcuts** - limit clicking - Operating system - cut and paste - `ctrl/cmd + c`, `ctrl/cmd + v` - text selection - `shift + home/end` - RStudio system - run line of code - `ctrl/cmd + enter` - use the `Tab` button for code and file path completion </br> - **Know how important symbols are accessed on the keyboard** - `$`, `(`, `)`, `[`, `]`, `{`, `}`, `#`, `_`, `~` `<` `>` `=` `!` `,` `-`, etc .. --- class: middle, center, inverse # Let's get started with R and RStudio! .section-subtitle[ Within RStudio, open the file .inverse[`M01-01-using-RStudio.R`] and attempt the exercises ] --- # Section recap - Examined the layout and use of RStudio, - Where R lives (the console) - Entered code into the R console, from a script and from parts of code - Stored values and retrieved values in objects, observed them in the environment - Observed outputs (print and plot) - R sessions - The working directory - Used the file browser - Observed some help documentation - File and Tools menu of RStudio for new item and RStudio configuration --- class: middle # First steps in R --- # R is a calculator We type expressions, R evaluates<sup>1</sup> .footnote[ [1] `Ctrl/Cmd + Enter` sends code from script editor to the R console in RStudio ] -- ``` r 1 + 1 ``` ``` #> [1] 2 ``` </br> -- ``` r 4 * 2 ``` ``` #> [1] 8 ``` </br> -- ``` r # Has math functions sin(pi / 4) ``` ``` #> [1] 0.7071068 ``` </br> --- # Storing values and results We **assign** values into **objects** using `<-` <sup>1</sup> (or `=`) ``` r result1 <- 1 + 1 # recommended to use <- result2 = sin(pi / 4) ``` .footnote[ [1] **Keyboard shortcut**: `Alt` + `-` ] -- Note: generally **nothing** happens if we correctly store the result! -- </br> To access the value, we use the object name ``` r result1 ``` ``` #> [1] 2 ``` -- ``` r result2 ``` ``` #> [1] 0.7071068 ``` --- # Naming objects The are **rules**: must be built from **letters**, **numbers**, **underscore** and **period**, but **cannot start** with numbers. ``` r x <- 1 x_1 <- 1 x.1 <- 1 1x <- 1 # error ``` .output-red[ ``` #> Error: unexpected symbol in '1x' ``` ] -- Use **consistent, meaningful, descriptive** object names ``` r # Storing the value of the number of days in 6 months m6 <- 365 / 12 * 6 days_in_6_months <- 365 / 12 * 6 ``` -- There is terminology to the style you can use ``` r i_use_snake_case # snake_case otherPeopleUseCamelCase # camelCase some.people.use.periods # periods And_aFew.People_RENOUNCEconvention # ??? ``` -- Do not be afraid of long names! RStudio has `<tab>` completion, which helps you type quickly & correctly! --- # Object names are case-sensitive ``` r days_in_6_months <- 365 / 12 * 6 ``` <br> -- Must match the name exactly ... .output-red[ ``` r Days_in_6_months ``` ``` #> Error: object 'Days_in_6_months' not found ``` ] </br> -- Like this .. ``` r days_in_6_months ``` ``` #> [1] 182.5 ``` -- .footnote[ Again, using the `<tab>` key in RStudio will help you type and be correct ] --- # Doing calculations on sets of numbers - vectors .footnote[ You will use `c()` a lot with R ... remember it! ] The function `c()` combines values in a **vector** ``` r weights_kg <- c(61, 72, 52, 90, 93, 71) weights_kg ``` ``` #> [1] 61 72 52 90 93 71 ``` </br> -- after which we can use in calculations ``` r # convert values to grams weights_g <- weights_kg * 1000 weights_g ``` ``` #> [1] 61000 72000 52000 90000 93000 71000 ``` --- # Functions help us do calculations Many things in R are done using **functions** ... e.g. statistical functions ``` r weights_mean <- mean(weights_kg) weights_mean ``` ``` #> [1] 73.16667 ``` -- ``` r weights_mean <- sum(weights_kg) / length(weights_kg) weights_mean ``` ``` #> [1] 73.16667 ``` </br> -- **How do they work?** What is this doing? ``` r seq(0, 10, 2) ``` ``` #> [1] 0 2 4 6 8 10 ``` --- # Functions have **arguments**, and arguments are given **values** ``` r function_name(argument1 = value1, argument2 = value2, ...) ``` <br> -- The `seq()` function ``` # Help documentation Generate regular sequences ... seq(from, to, by, length.out, along.with, ...) ``` </br> -- Generate a sequence **from=** 0 **to=** 10 **by=** 2 ``` r seq(0, 10, 2) ``` ``` #> [1] 0 2 4 6 8 10 ``` --- # Using functions and arguments - Arguments can be named, and **this is a good practice** .. code is **readable** ``` r seq(from = 0, to = 10, by = 2) ``` ``` #> [1] 0 2 4 6 8 10 ``` <br> -- - Unnamed arguments need be in order ("positional") of the function definition - without names the code is **often ambiguous** ... **use named arguments!** - a *majority of R code from others looks like this* **:o(** ``` r seq(0, 10, 2) ``` ``` #> [1] 0 2 4 6 8 10 ``` <br> -- - Named arguments can be in any order ``` r seq(by = 2, to = 10, from = 0) ``` ``` #> [1] 0 2 4 6 8 10 ``` --- # Some functions have **default values** for arguments Why does this work if we only use `to`? ``` r seq(to = 4) ``` ``` #> [1] 1 2 3 4 ``` <br> -- The help documentation reveals these **default values** ``` # Help documentation seq(from = 1, to = 1, by = ((to - from)/(length.out - 1)), ...) ``` <br> -- .pull-left[ Thus ``` r seq(to = 4) ``` ] -- .pull-right[ is actually ``` r seq(from = 1, to = 4, by = 1) ``` ] --- # What arguments and information are available for a function? Executing the function with `?` before it opens **help documentation** ``` r ?seq ``` -- .pull-left[ <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#images/help-documentation.png" alt="RStudio help viewer panel" width="100%" /> <p class="caption">RStudio help viewer panel</p> </div> ] -- .pull-right[ - R help documentation is standardised - Title - Description - Usage - Arguments - Further details - Examples ] --- # What arguments and information are available for a function? Do not be afraid to ask Google! <img src="data:image/png;base64,#images/google-help.png" width="700px" style="display: block; margin: auto;" /> --- # Functions live in packages<sup>1</sup> **Packages** are bundles of code, data, documentation that do specific things - Example of functions in the base R `stats` package ``` r ls("package:stats") ``` .scrollable-slide-300[ ``` #> [1] "acf" "acf2AR" "add.scope" "add1" "addmargins" "aggregate" #> [7] "aggregate.data.frame" "aggregate.ts" "AIC" "alias" "anova" "ansari.test" #> [13] "aov" "approx" "approxfun" "ar" "ar.burg" "ar.mle" #> [19] "ar.ols" "ar.yw" "arima" "arima.sim" "arima0" "arima0.diag" #> [25] "ARMAacf" "ARMAtoMA" "as.dendrogram" "as.dist" "as.formula" "as.hclust" #> [31] "as.stepfun" "as.ts" "asOneSidedFormula" "ave" "bandwidth.kernel" "bartlett.test" #> [37] "BIC" "binom.test" "binomial" "biplot" "Box.test" "bw.bcv" #> [43] "bw.nrd" "bw.nrd0" "bw.SJ" "bw.ucv" "C" "cancor" #> [49] "case.names" "ccf" "chisq.test" "cmdscale" "coef" "coefficients" #> [55] "complete.cases" "confint" "confint.default" "confint.lm" "constrOptim" "contr.helmert" #> [61] "contr.poly" "contr.SAS" "contr.sum" "contr.treatment" "contrasts" "contrasts<-" #> [67] "convolve" "cooks.distance" "cophenetic" "cor" "cor.test" "cov" #> [73] "cov.wt" "cov2cor" "covratio" "cpgram" "cutree" "cycle" #> [79] "D" "dbeta" "dbinom" "dcauchy" "dchisq" "decompose" #> [85] "delete.response" "deltat" "dendrapply" "density" "density.default" "deriv" #> [91] "deriv3" "deviance" "dexp" "df" "df.kernel" "df.residual" #> [97] "DF2formula" "dfbeta" "dfbetas" "dffits" "dgamma" "dgeom" #> [103] "dhyper" "diffinv" "dist" "dlnorm" "dlogis" "dmultinom" #> [109] "dnbinom" "dnorm" "dpois" "drop.scope" "drop.terms" "drop1" #> [115] "dsignrank" "dt" "dummy.coef" "dummy.coef.lm" "dunif" "dweibull" #> [121] "dwilcox" "ecdf" "eff.aovlist" "effects" "embed" "end" #> [127] "estVar" "expand.model.frame" "extractAIC" "factanal" "factor.scope" "family" #> [133] "fft" "filter" "fisher.test" "fitted" "fitted.values" "fivenum" #> [139] "fligner.test" "formula" "frequency" "friedman.test" "ftable" "Gamma" #> [145] "gaussian" "get_all_vars" "getCall" "getInitial" "glm" "glm.control" #> [151] "glm.fit" "hasTsp" "hat" "hatvalues" "hclust" "heatmap" #> [157] "HoltWinters" "influence" "influence.measures" "integrate" "interaction.plot" "inverse.gaussian" #> [163] "IQR" "is.empty.model" "is.leaf" "is.mts" "is.stepfun" "is.ts" #> [169] "is.tskernel" "isoreg" "KalmanForecast" "KalmanLike" "KalmanRun" "KalmanSmooth" #> [175] "kernapply" "kernel" "kmeans" "knots" "kruskal.test" "ks.test" #> [181] "ksmooth" "lag" "lag.plot" "line" "lm" "lm.fit" #> [187] "lm.influence" "lm.wfit" "loadings" "loess" "loess.control" "loess.smooth" #> [193] "logLik" "loglin" "lowess" "ls.diag" "ls.print" "lsfit" #> [199] "mad" "mahalanobis" "make.link" "makeARIMA" "makepredictcall" "manova" #> [205] "mantelhaen.test" "mauchly.test" "mcnemar.test" "median" "median.default" "medpolish" #> [211] "model.extract" "model.frame" "model.frame.default" "model.matrix" "model.matrix.default" "model.matrix.lm" #> [217] "model.offset" "model.response" "model.tables" "model.weights" "monthplot" "mood.test" #> [223] "mvfft" "na.action" "na.contiguous" "na.exclude" "na.fail" "na.omit" #> [229] "na.pass" "napredict" "naprint" "naresid" "nextn" "nlm" #> [235] "nlminb" "nls" "nls.control" "NLSstAsymptotic" "NLSstClosestX" "NLSstLfAsymptote" #> [241] "NLSstRtAsymptote" "nobs" "numericDeriv" "offset" "oneway.test" "optim" #> [247] "optimHess" "optimise" "optimize" "order.dendrogram" "p.adjust" "p.adjust.methods" #> [253] "pacf" "Pair" "pairwise.prop.test" "pairwise.t.test" "pairwise.table" "pairwise.wilcox.test" #> [259] "pbeta" "pbinom" "pbirthday" "pcauchy" "pchisq" "pexp" #> [265] "pf" "pgamma" "pgeom" "phyper" "plclust" "plnorm" #> [271] "plogis" "plot.ecdf" "plot.spec.coherency" "plot.spec.phase" "plot.stepfun" "plot.ts" #> [277] "pnbinom" "pnorm" "poisson" "poisson.test" "poly" "polym" #> [283] "power" "power.anova.test" "power.prop.test" "power.t.test" "PP.test" "ppoints" #> [289] "ppois" "ppr" "prcomp" "predict" "predict.glm" "predict.lm" #> [295] "preplot" "princomp" "printCoefmat" "profile" "proj" "promax" #> [301] "prop.test" "prop.trend.test" "psignrank" "psmirnov" "pt" "ptukey" #> [307] "punif" "pweibull" "pwilcox" "qbeta" "qbinom" "qbirthday" #> [313] "qcauchy" "qchisq" "qexp" "qf" "qgamma" "qgeom" #> [319] "qhyper" "qlnorm" "qlogis" "qnbinom" "qnorm" "qpois" #> [325] "qqline" "qqnorm" "qqplot" "qr.influence" "qsignrank" "qsmirnov" #> [331] "qt" "qtukey" "quade.test" "quantile" "quasi" "quasibinomial" #> [337] "quasipoisson" "qunif" "qweibull" "qwilcox" "r2dtable" "rbeta" #> [343] "rbinom" "rcauchy" "rchisq" "read.ftable" "rect.hclust" "reformulate" #> [349] "relevel" "reorder" "replications" "reshape" "resid" "residuals" #> [355] "residuals.glm" "residuals.lm" "rexp" "rf" "rgamma" "rgeom" #> [361] "rhyper" "rlnorm" "rlogis" "rmultinom" "rnbinom" "rnorm" #> [367] "rpois" "rsignrank" "rsmirnov" "rstandard" "rstudent" "rt" #> [373] "runif" "runmed" "rweibull" "rwilcox" "rWishart" "scatter.smooth" #> [379] "screeplot" "sd" "se.contrast" "selfStart" "setNames" "shapiro.test" #> [385] "sigma" "simulate" "smooth" "smooth.spline" "smoothEnds" "sortedXyData" #> [391] "spec.ar" "spec.pgram" "spec.taper" "spectrum" "spline" "splinefun" #> [397] "splinefunH" "SSasymp" "SSasympOff" "SSasympOrig" "SSbiexp" "SSD" #> [403] "SSfol" "SSfpl" "SSgompertz" "SSlogis" "SSmicmen" "SSweibull" #> [409] "start" "stat.anova" "step" "stepfun" "stl" "StructTS" #> [415] "summary.aov" "summary.glm" "summary.lm" "summary.manova" "summary.stepfun" "supsmu" #> [421] "symnum" "t.test" "termplot" "terms" "terms.formula" "time" #> [427] "toeplitz" "toeplitz2" "ts" "ts.intersect" "ts.plot" "ts.union" #> [433] "tsdiag" "tsp" "tsp<-" "tsSmooth" "TukeyHSD" "uniroot" #> [439] "update" "update.default" "update.formula" "var" "var.test" "variable.names" #> [445] "varimax" "vcov" "weighted.mean" "weighted.residuals" "weights" "wilcox.test" #> [451] "window" "window<-" "write.ftable" "xtabs" ``` ] .footnote[ [1] You can make functions outside of packages too ] --- # Errors and warnings in R When **something fails**, you will receive a message in .red[red] ... Be *calm*, and **read the message** ... -- An **error** means something has failed, and **no result produced** - An error by the user e.g. **incorrect code** - An error from the computer e.g. **computer has used all memory** .output-red[ ``` r seq(from = NA, to = 5, by = 2) ``` ``` #> Error in seq.default(from = NA, to = 5, by = 2): 'from' must be a finite number ``` ] <br> -- A **warning** means something is not correct, but a **result still produced**. - Sometimes this is just useful information - Other times you will need to address the warning .output-red[.code-3-black[ ``` r seq(from = 0, TO = 6, by = 2) ``` ``` #> Warning: In seq.default(from = 0, TO = 6, by = 2) : #> extra argument 'TO' will be disregarded ``` ``` #> [1] 0 ``` ] ] --- class: center, middle, inverse # Let's do some exercises! .section-subtitle[ Within RStudio, open the file .inverse[`M01-02-coding-basics.R`] and attempt the exercises. ] --- # Section recap - Did math calculations - Did calculation with sets of numbers in objects - Stored results in well-named objects - Played with some fun functions and their arguments - Examined help files - Observed error and warning messages - Use <tab> completion --- class: middle # Data types, vectors and data.frames --- # The data.frame <style type="text/css"> .dataframe-important { position: absolute; top: 2%; right: 2%; width: 180px; } </style> .dataframe-important[ <!-- --> ] We typically use a data structure for data analysis called a **data.frame**. A data.frame is a **rectangular collection of different data vectors of the same length** ... a "spreadsheet of data" - **columns** are data vectors ... variables - **rows** are related values across columns ... cases, observations .pull-left-60[ ``` r example_dataframe ``` ``` #> ID entry_date age sex weight test_complete #> 1 P1 2024-01-02 30 Male 66.2 FALSE #> 2 P2 2024-01-02 41 Female 89.9 TRUE #> 3 P3 2024-01-01 48 Female 37.4 TRUE #> 4 P4 2024-01-02 42 Female 76.8 TRUE #> 5 P5 2024-01-02 24 Male 60.3 FALSE #> 6 P6 2024-01-01 49 Female 65.1 FALSE #> 7 P7 2024-01-01 50 Male 74.9 TRUE #> 8 P8 2024-01-03 28 Male 64.1 TRUE ``` ] -- .pull-right-40[ The column data types ``` #> type #> ID "character" #> entry_date "Date" #> age "integer" #> sex "character" #> weight "numeric" #> test_complete "logical" ``` ] <br> -- We must understand **data types and vectors** in R before continuing ... --- # Data types R has: **numbers**, **text** , **logical** (TRUE/FALSE) and **dates/times** .pull-left[ **What you see** ``` r 5 ``` ``` #> [1] 5 ``` ] .pull-right[ **What it is to R** ``` #> [1] "numeric" "double" "integer" ``` ] -- .pull-left[ ``` r "Tom" # or 'Tom' .. " or ' ``` ``` #> [1] "Tom" ``` ] .pull-right[ ``` #> [1] "character" ``` ] -- .pull-left[ ``` r TRUE # or FALSE ``` ``` #> [1] TRUE ``` ] .pull-right[ ``` #> [1] "logical" ``` ] -- .pull-left[ ``` r Sys.Date() ``` ``` #> [1] "2025-05-19" ``` ] .pull-right[ ``` #> [1] "Date" "double" ``` ] --- # Data types - missing values R uses `NA` for a missing value ("Not Available) and `NaN` for "Not a Number" .pull-left[ ``` r NA ``` ``` #> [1] NA ``` ] .pull-right[ ``` r NaN ``` ``` #> [1] NaN ``` ] -- .pull-left[] .pull-right[ ``` r 0 / 0 ``` ``` #> [1] NaN ``` ] .pull-right[ <br> ] <br> -- These can have **special** treatment in functions ... we must be mindful of them always! .pull-left[ ``` r mean(c(1,2,NA,3)) ``` ``` #> [1] NA ``` ] -- .pull-right[ ``` r mean(c(1,2,NA,3), na.rm = T) ``` ``` #> [1] 2 ``` ] --- # Data live in vectors We store and work with data in **vectors** ... and later have them as **columns** in data frames ``` r # The function `c()` combines values in a vector a_number_vector <- c(5, 7, 2, 5, 6, 4) a_number_vector ``` ``` #> [1] 5 7 2 5 6 4 ``` -- ``` r a_text_vector <- c("Low", "Mid", "High", "Low", "Mid", 'High') a_text_vector ``` ``` #> [1] "Low" "Mid" "High" "Low" "Mid" "High" ``` </br> -- Example using two vectors together mean `a_number_vector` values by groups in `a_text_vector` ``` r tapply(X = a_number_vector, INDEX = a_text_vector, FUN = mean) ``` ``` #> High Low Mid #> 3.0 5.0 6.5 ``` --- # Only **one** data type can live in a vector R automatically changes type ("coercion") when different ones exist in a vector - **Important** when we import "messy" data .. leads to unexpected data types .pull-left[ ``` r # now all TEXT c(1, 'Tom', TRUE) ``` ``` #> [1] "1" "Tom" "TRUE" ``` ] -- .pull-right[ ``` r # now all NUMBERS c(1, 2, TRUE, FALSE) ``` ``` #> [1] 1 2 1 0 ``` ] -- .pull-left[ ``` r # now all NUMBERS c(1, Sys.Date()) ``` ``` #> [1] 1 20227 ``` ] -- .pull-right[ ``` r # now all TEXT c('Tom', Sys.Date()) ``` ``` #> [1] "Tom" "20227" ``` ] -- .pull-left[ ``` r # NA stays NA c(1, NA, 2, '3') ``` ``` #> [1] "1" NA "2" "3" ``` ] --- # Values can have an identifier - a name Vector values can have **names** - Some function will take named vectors for input e.g. for applying colours to groups in a plot ``` r named_vector <- c(Low = 1, Mid = 2, High = 3) named_vector ``` ``` #> Low Mid High #> 1 2 3 ``` --- # Factor type / vectors - categorising values into groups **Factors** are special vectors for categorising data values, and used extensively in statistical and plot routines - Contain **levels** - information about what groups/categories are present ``` r treatment_factor <- factor(c("Low", "Mid", "High", "Low", "Mid", "High")) treatment_factor ``` ``` #> [1] Low Mid High Low Mid High #> Levels: High Low Mid ``` <br> .footnote[ We will explore these later in the plotting and statistics modules ] -- **Numbers to factors** - The values now represent the groups "1", "3" and "5" and **not** the numbers ... "factor levels" ``` r numbers_factor <- factor(c(1, 3, 5, 1, 3, 5)) numbers_factor ``` ``` #> [1] 1 3 5 1 3 5 #> Levels: 1 3 5 ``` --- # Vectors can live in different **data structures** **A vector** ``` #> [1] 1 2 3 4 5 6 7 8 9 ``` .pull-left[ **Matrices** ``` #> [,1] [,2] [,3] #> [1,] 1 4 7 #> [2,] 2 5 8 #> [3,] 3 6 9 ``` **Data frames** ``` #> a_vector text logical #> 1 2 Jerry FALSE #> 2 10 Jerry FALSE #> 3 6 Jerry FALSE #> 4 1 Tom FALSE #> 5 9 Quacker FALSE #> 6 3 Tom TRUE #> 7 4 Tom TRUE #> 8 8 Tom FALSE ``` ] .pull-right[ **Lists** ``` #> $a_vector #> [1] 6 10 5 3 8 4 1 9 #> #> $group #> [1] "person_list" #> #> $n_persons #> [1] 3 #> #> $persons #> names ages #> 1 Tom 21 #> 2 Jerry 25 #> 3 Quacker 29 ``` ] -- **We will focus on data frames** --- # Visualisation of vectors in different data structures <img src="data:image/png;base64,#images/r-data-structures.png" width="95%" /> --- # data.frames are collections of vectors We should always check the structure and vectors of a data.frame before using one with `str()` ``` r example_dataframe ``` ``` #> ID entry_date age sex weight test_complete #> 1 P1 2024-01-02 30 Male 66.2 FALSE #> 2 P2 2024-01-02 41 Female 89.9 TRUE #> 3 P3 2024-01-01 48 Female 37.4 TRUE #> 4 P4 2024-01-02 42 Female 76.8 TRUE #> 5 P5 2024-01-02 24 Male 60.3 FALSE #> 6 P6 2024-01-01 49 Female 65.1 FALSE #> 7 P7 2024-01-01 50 Male 74.9 TRUE #> 8 P8 2024-01-03 28 Male 64.1 TRUE ``` -- ``` r str(example_dataframe) ``` ``` #> 'data.frame': 8 obs. of 6 variables: #> $ ID : chr "P1" "P2" "P3" "P4" ... #> $ entry_date : Date, format: "2024-01-02" "2024-01-02" "2024-01-01" "2024-01-02" ... #> $ age : int 30 41 48 42 24 49 50 28 #> $ sex : chr "Male" "Female" "Female" "Female" ... #> $ weight : num 66.2 89.9 37.4 76.8 60.3 65.1 74.9 64.1 #> $ test_complete: logi FALSE TRUE TRUE TRUE FALSE FALSE ... ``` --- # Importing data typically results in a data.frame ``` r penguins <- read.csv(file = 'data/penguins.csv') ``` -- ``` r class(penguins) ``` ``` #> [1] "data.frame" ``` -- ``` r # View the first 6 rows head(penguins) ``` ``` #> species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year #> 1 Adelie Torgersen 39.1 18.7 181 3750 male 2007 #> 2 Adelie Torgersen 39.5 17.4 186 3800 female 2007 #> 3 Adelie Torgersen 40.3 18.0 195 3250 female 2007 #> 4 Adelie Torgersen NA NA NA NA <NA> 2007 #> 5 Adelie Torgersen 36.7 19.3 193 3450 female 2007 #> 6 Adelie Torgersen 39.3 20.6 190 3650 male 2007 ``` --- class: middle # Using columns within data.frames --- # Accessing column data .footnote[ `$` for interactive analysis ... `[[]]` for programming and in functions ] .pull-left-40[ We use - `$` ... the dollar sign - `[[]]` ... double square brackets which returns a vector of data ] -- .pull-right-60[ ``` r example_dataframe ``` ``` #> ID entry_date age sex weight test_complete #> 1 P1 2024-01-02 30 Male 66.2 FALSE #> 2 P2 2024-01-02 41 Female 89.9 TRUE #> 3 P3 2024-01-01 48 Female 37.4 TRUE #> 4 P4 2024-01-02 42 Female 76.8 TRUE #> 5 P5 2024-01-02 24 Male 60.3 FALSE #> 6 P6 2024-01-01 49 Female 65.1 FALSE #> 7 P7 2024-01-01 50 Male 74.9 TRUE #> 8 P8 2024-01-03 28 Male 64.1 TRUE ``` ] -- ``` r example_dataframe$sex ``` ``` #> [1] "Male" "Female" "Female" "Female" "Male" "Female" "Male" "Male" ``` -- ``` r # Column name as a "string" example_dataframe[['sex']] ``` ``` #> [1] "Male" "Female" "Female" "Female" "Male" "Female" "Male" "Male" ``` --- # Using columns in functions .footnote[ `$` for interactive analysis ... `[[]]` for programming and in functions ] .pull-left-40[ Use form `function(dataframe$column)` ] .pull-right-60[ ``` r example_dataframe ``` ``` #> ID entry_date age sex weight test_complete #> 1 P1 2024-01-02 30 Male 66.2 FALSE #> 2 P2 2024-01-02 41 Female 89.9 TRUE #> 3 P3 2024-01-01 48 Female 37.4 TRUE #> 4 P4 2024-01-02 42 Female 76.8 TRUE #> 5 P5 2024-01-02 24 Male 60.3 FALSE #> 6 P6 2024-01-01 49 Female 65.1 FALSE #> 7 P7 2024-01-01 50 Male 74.9 TRUE #> 8 P8 2024-01-03 28 Male 64.1 TRUE ``` ] -- ``` r mean(example_dataframe$weight) ``` ``` #> [1] 66.8375 ``` -- ``` r table(example_dataframe[['sex']]) ``` ``` #> #> Female Male #> 4 4 ``` --- # Using columns in functions Plot the `age` and `weight` columns ``` r plot(x = example_dataframe$age, y = example_dataframe$weight) ``` <br> <img src="data:image/png;base64,#C:/Users/shaun/Documents/r-for-scientific-research/lecture-notes/01-Module-1-R-basics-and-data-import_files/figure-html/unnamed-chunk-176-1.png" style="display: block; margin: auto;" /> --- # Using columns in functions Sometimes a function takes in a `data=` argument and we do not need `dataframe$...` -- Conduct a t-test comparing `weight` between levels of `sex` ``` r t.test(weight ~ sex, data = example_dataframe) #t.test(example_dataframe$weight ~ example_dataframe$sex) # also valid ``` ``` #> #> Welch Two Sample t-test #> #> data: weight by sex #> t = 0.079743, df = 3.4565, p-value = 0.9408 #> alternative hypothesis: true difference in means between group Female and group Male is not equal to 0 #> 95 percent confidence interval: #> -33.37989 35.22989 #> sample estimates: #> mean in group Female mean in group Male #> 67.300 66.375 ``` .footnote[ The tilde symbol `~` is part of a **formula** means "weight explained by sex" ] --- class: center, middle, inverse # Exercise time! .section-subtitle[ Open the file .inverse[`M01-03-data-types-and-data.frames.R`] and attempt the exercises. ] --- # Section recap - examined some data vector properties - examined some built-in data.frames in R - obtained information about a data.frame: structure, data types, columns and rows - conducted some calculations on data.frame columns - plot data from a data.frame --- class: middle # Using our newly acquired knowledge --- # Let's conduct a simple analysis in R We will follow a simple linear regression analysis to see what R can do - Obtain a dataset - Examine the data - Obtain some data summaries - Plot the data - Fit a linear regression model - Obtain the model summary - Plot the data with the regression line -- The **goal for you** is to observe how - code is written - objects are created - you inspect data - data columns are selected - statistical output is presented ... **no not worry if this is too complex .. just observe!** --- # The research question Is there a relationship between **petal length** and **petal width** of Iris species? <!-- --> --- # The research data `iris` is a dataset within R ``` r # Show the first and lasst rows of the data head(iris) ``` ``` #> Sepal.Length Sepal.Width Petal.Length Petal.Width Species #> 1 5.1 3.5 1.4 0.2 setosa #> 2 4.9 3.0 1.4 0.2 setosa #> 3 4.7 3.2 1.3 0.2 setosa #> 4 4.6 3.1 1.5 0.2 setosa #> 5 5.0 3.6 1.4 0.2 setosa #> 6 5.4 3.9 1.7 0.4 setosa ``` ``` r tail(iris) ``` ``` #> Sepal.Length Sepal.Width Petal.Length Petal.Width Species #> 145 6.7 3.3 5.7 2.5 virginica #> 146 6.7 3.0 5.2 2.3 virginica #> 147 6.3 2.5 5.0 1.9 virginica #> 148 6.5 3.0 5.2 2.0 virginica #> 149 6.2 3.4 5.4 2.3 virginica #> 150 5.9 3.0 5.1 1.8 virginica ``` --- # A summary of the data `summary()` will provide a numerical or group summary depending on what is in the column For `iris` - Numerical summaries of sepal and petal measurements - Counts of species from which measurements were obtained ``` r summary(iris) ``` ``` #> Sepal.Length Sepal.Width Petal.Length Petal.Width Species #> Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 setosa :50 #> 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 versicolor:50 #> Median :5.800 Median :3.000 Median :4.350 Median :1.300 virginica :50 #> Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199 #> 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800 #> Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500 ``` --- # Visualising the data `plot()` can provide a us with a visualisation of the data. There are arguments for labels and other plot components. .pull-left[ ``` r plot(x = iris$Petal.Width, y = iris$Petal.Length, main = "Petal width vs petal length", xlab = "Petal width (mm)", ylab = "Petal length (mm)", pch = 19) ``` ] -- .pull-right[ <img src="data:image/png;base64,#C:/Users/shaun/Documents/r-for-scientific-research/lecture-notes/01-Module-1-R-basics-and-data-import_files/figure-html/petal-width-length-plot-1.png" style="display: block; margin: auto;" /> ] --- # Fitting a linear regression model `lm()` is the linear model function in R - It uses a **formula interface** `y ~ x` - The `data= ` is where `y` and `x` come from ... no need for `iris$x` here ``` r petal_length_width_lm <- lm(Petal.Length ~ Petal.Width, data = iris) ``` <br> -- Printing this object to the console shows only brief information - model coefficients ``` r petal_length_width_lm ``` ``` #> #> Call: #> lm(formula = Petal.Length ~ Petal.Width, data = iris) #> #> Coefficients: #> (Intercept) Petal.Width #> 1.084 2.230 ``` --- # Obtaining a model summary `summary()` on a linear model object prints detailed model information ``` r summary(petal_length_width_lm) ``` .pull-left-66[ ``` #> #> Call: #> lm(formula = Petal.Length ~ Petal.Width, data = iris) #> #> Residuals: #> Min 1Q Median 3Q Max #> -1.33542 -0.30347 -0.02955 0.25776 1.39453 #> #> Coefficients: #> Estimate Std. Error t value Pr(>|t|) #> (Intercept) 1.08356 0.07297 14.85 <2e-16 *** #> Petal.Width 2.22994 0.05140 43.39 <2e-16 *** #> --- #> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 #> #> Residual standard error: 0.4782 on 148 degrees of freedom #> Multiple R-squared: 0.9271, Adjusted R-squared: 0.9266 #> F-statistic: 1882 on 1 and 148 DF, p-value: < 2.2e-16 ``` ] .pull-right-33[ **Do not worry about these details, other than they are available** **We will explore them later** ] --- # Plotting the regression line from the model We can then link different R tasks together e.g. to plot the regression line from the model ``` #> Coefficients: #> (Intercept) Petal.Width #> 1.084 2.230 ``` .pull-left[ Plot the data then add line using `abline()` ``` r plot(x = iris$Petal.Width, y = iris$Petal.Length, main = "Petal width vs petal length", xlab = "Petal width (mm)", ylab = "Petal length (mm)", pch = 19) abline(reg = petal_length_width_lm, col = 'red', lwd = 2) ``` ] .pull-right[ <img src="data:image/png;base64,#C:/Users/shaun/Documents/r-for-scientific-research/lecture-notes/01-Module-1-R-basics-and-data-import_files/figure-html/petal-width-length-plot-regression-1.png" style="display: block; margin: auto;" /> ] --- class: center, middle, inverse # Your turn! .section-subtitle[ Open the file .inverse[`M01-04-your-first-analysis.R`] and fill in the blanks to do you first full analysis in R. ] --- class: middle # Extending R with additional R packages .footnote[ An R package is additional **bundle of code, data and documentation** to do a specific task ] --- # Installing new R packages R packages live online in curated **repositories** or in the wild on GitHub and other software repository websites. ### CRAN - **CRAN (Comprehensive R Archive Network)**: This is the primary and **default** repository for R packages. [See the list online.](https://cran.r-project.org/web/packages/available_packages_by_name.html) -- ### Others - **Bioconductor**: Specialized in bioinformatics packages. [See the list online.](https://www.bioconductor.org/packages/release/BiocViews.html#___Software) - **GitHub and other sources**<sup>1</sup>: Contain source code for packages, from which installation is possible. .footnote[ [1] Source code for many of the CRAN and Bioconductor packages live here, but many others that **have not been through quality checking** of CRAN and Bioconductor. ] --- # R packages - CRAN We **install** a package onto the computer using `install.package()`: ``` r install.packages('tidyverse') ``` .footnote[ **You should have installed `tidyverse` prior to the course!** ] -- </br> But to use it in our **current R session**, we must **load** it using `library()` ``` r library(tidyverse) # Now can use tidyverse functions ``` -- **Remember**: when you restart R, the package will **no longer** be loaded --- # Installing verus loading a package **Installing** a package with `install.package()` downloads the package<sup>1</sup> to your computer - You do this once, or when you want to **update** a package - `install.packages()` should **not** be in your scripts <img src="data:image/png;base64,#images/internet-to-computer.png" width="300px" style="display: block; margin: auto;" /> <br> <br> .footnote[ [1] It may actually download **multiple packages** (including those itself uses) ] -- **Loading** a package using `library()` makes the package available to your **current R session** - You need to load packages very time you start R/RStudio - At the beginning of your R scripts you should load the packages required <img src="data:image/png;base64,#images/computer-to-R.png" width="300px" style="display: block; margin: auto;" /> --- # What R packages do you have? The RStudio 'Packages' tab details this <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#images/rstudio-screenshot-packages.png" alt="RStudio packages" width="650px" /> <p class="caption">RStudio packages</p> </div> --- # Do I have this R package? ``` r library(dundermifflin) ``` .red[ ``` #> Error in library(dundermifflin): there is no package called 'dundermifflin' ``` ] </br> -- **Therefore** ``` r install.packages('dundermifflin') ``` ``` #> Installing package into ... #> ... #> package ‘dundermifflin’ successfully unpacked and MD5 sums checked ``` -- ``` r library(dundermifflin) get_quote() ``` ``` #> Is this the same grill you grilled your foot on? #> ~ Ryan #> Season 3, Episode 14 - Ben Franklin ``` --- # Exploring the documentation of a package in RStudio .pull-left[ <!-- --> ] -- .pull-right[ <!-- --> ] --- # The tidyverse set of packages <img src="data:image/png;base64,#images/tidyverse.png" width="83%" style="display: block; margin: auto;" /> .footnote[ [https://www.tidyverse.org/](https://www.tidyverse.org/) ] --- # Namespacing **Namespacing** allows us explicitly use a function from a given package (or even without loading the package) - `package_name::function_name` (use the double colon) -- **E.g.** with 2 packages have the same function name e.g `filter()` is a common function name! .pull-left[ ``` r stats::filter ``` ] .pull-right[ ``` r dplyr::filter ``` ] -- **E.g.** ``` r readr::read_csv('../../data/penguins.csv', show_col_types = FALSE) ``` ``` #> # A tibble: 344 × 8 #> species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year #> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> #> 1 Adelie Torgersen 39.1 18.7 181 3750 male 2007 #> 2 Adelie Torgersen 39.5 17.4 186 3800 female 2007 #> 3 Adelie Torgersen 40.3 18 195 3250 female 2007 #> 4 Adelie Torgersen NA NA NA NA <NA> 2007 #> 5 Adelie Torgersen 36.7 19.3 193 3450 female 2007 #> 6 Adelie Torgersen 39.3 20.6 190 3650 male 2007 #> # ℹ 338 more rows ``` .footnote[ Using RStudio tab completion with `package_name::` lists all of it's functions! ] --- class: center, middle, inverse # R packages exercises .section-subtitle[ Will be conducted for homework (at the end of the module) ] --- class: middle # Import and export of (tabular) data --- # Overview R can import and export many file types ... but this may depend on additional R packages - text files: CSV, TSV - Excel spreadsheets - Files from other statistical programs: SAS, SPSS, Stata, ... - Databases - Online: googlesheets - R's own files types: `.rds` and `.Rdata` <img src="data:image/png;base64,#images/r-import-export.png" width="60%" style="display: block; margin: auto;" /> .footnote[ Figure from https://epirhandbook.com/ ] --- # The import result is typically a data.frame A new **data frame** object from a **file** on your computer or from the internet using a **file path<sup>1</sup>** as input ``` r penguins <- read.csv(file = 'data/penguins.csv') ``` .footnote[ [1] File paths are a location on your computer. Often a difficulty for beginners to understand, we will examine in exercises. ] -- ``` r # View the first 6 rows head(penguins) ``` ``` #> species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year #> 1 Adelie Torgersen 39.1 18.7 181 3750 male 2007 #> 2 Adelie Torgersen 39.5 17.4 186 3800 female 2007 #> 3 Adelie Torgersen 40.3 18.0 195 3250 female 2007 #> 4 Adelie Torgersen NA NA NA NA <NA> 2007 #> 5 Adelie Torgersen 36.7 19.3 193 3450 female 2007 #> 6 Adelie Torgersen 39.3 20.6 190 3650 male 2007 ``` --- # Mastering data import ### File paths - Where the file located on your computer or externally - Where that location is relative to your R working directory ### File types - What type of file are you importing? - What function imports that file type? ### Structure of the raw data - Column names - Missing value coding - Comment lines ### Mastering these will reduce file import problems ... --- # Common data files are **text** data files Comma-separated values (CSV) **text files** are the most common type of data files - Columns are separated (or delimited) by a comma `,` ``` Student ID,Full Name,favourite.food,mealPlan,AGE 1,Sunil Huffmann,Strawberry yoghurt,Lunch only,4 2,Barclay Lynn,French fries,Lunch only,5 3,Jayendra Lyne,N/A,Breakfast and lunch,7 4,Leon Rossini,Anchovies,Lunch only, 5,Chidiegwu Dunkel,Pizza,Breakfast and lunch,five 6,Güvenç Attila,Ice cream,Lunch only,6 ``` -- .pull-left[ but other delimiters exist and are **used frequently!** - tab = `\t` . . . `file.tsv`, `file.txt` - pipe = `|` . . . `file.psv`, `file.txt` - colon = `;` - space = `\s` ] -- .pull-right[ R can import these natively using the `read.*` functions ``` r read.csv() read.delim() ... ``` but we will use the package **`readr`** ] --- # Importing a CSV file with **`readr`** The **`readr`** improves importing text files into R - provides useful information about the import of the data - returns a **tibble** data.frame ``` r # Part of the core tidyverse, loaded with library(tidyverse) # or library(readr) ``` -- using `read_csv()` ``` r students <- read_csv(file = "data/students.csv") ``` -- ``` #> Rows: 6 Columns: 5 #> ── Column specification ───────────────────────────────────────────────────────────────────── #> Delimiter: "," #> chr (4): Full Name, favourite.food, mealPlan, AGE #> dbl (1): Student ID #> #> ℹ Use `spec()` to retrieve the full column specification for this data. #> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message. ``` --- # a **tibble** data.frame A **tibble** is an improved type data.frame - functions as data.frame but has better properties - prints data.frame information to the console - column and row numbers, column types ``` r students ``` ``` #> # A tibble: 6 × 5 #> `Student ID` `Full Name` favourite.food mealPlan AGE #> <dbl> <chr> <chr> <chr> <chr> #> 1 1 "Sunil Huffmann" Strawberry yoghurt Lunch only 4 #> 2 2 "Barclay Lynn" French fries Lunch only 5 #> 3 3 "Jayendra Lyne" N/A Breakfast and lunch 7 #> 4 4 "Leon Rossini" Anchovies Lunch only <NA> #> 5 5 "Chidiegwu Dunkel" Pizza Breakfast and lunch unknown #> 6 6 "G\u00fcven\u00e7 Attila" Ice cream Lunch only 6 ``` .footnote[ https://tibble.tidyverse.org/ ] --- # a **tibble** data.frame A **tibble** is an improved type data.frame - functions as data.frame but has better properties - prints data.frame information to the console - column and row numbers, column types ``` r students ``` ``` #> Student ID Full Name favourite.food mealPlan AGE #> 1 1 Sunil Huffmann Strawberry yoghurt Lunch only 4 #> 2 2 Barclay Lynn French fries Lunch only 5 #> 3 3 Jayendra Lyne N/A Breakfast and lunch 7 #> 4 4 Leon Rossini Anchovies Lunch only <NA> #> 5 5 Chidiegwu Dunkel Pizza Breakfast and lunch unknown #> 6 6 Güvenç Attila Ice cream Lunch only 6 ``` *data.frame .footnote[ https://tibble.tidyverse.org/ ] --- # Other text file types You may also encounter **delimiters** (separators), which **`readr`** (and R) can handle with other functions -- .pull-left[ **tab-separated (TSV)** ``` r read_tsv('files.txt') ``` ``` x y z 1 2 3 4 5 6 ``` ``` #> # A tibble: 2 × 1 #> `x y z` #> <chr> #> 1 1 2 3 #> 2 4 5 6 ``` ] -- .pull-right[ **pipe-separated** or **any other delimiter** ``` r read_delim('files.psv', delim = '|')` ``` ``` x|y|z 1|2|3 4|5|6 ``` ``` #> # A tibble: 2 × 3 #> x y z #> <dbl> <dbl> <dbl> #> 1 1 2 3 #> 2 4 5 6 ``` ] --- # Controlling data import There are numerous options when importing data that will help streamline the import process when dealing with large or messy datasets - defining missing values - skipping metadata lines - defining column names - specifying column data types - limiting the number of rows imported - and others depending on the type of data being imported -- We will explore a few of these .. -- Remember to **read the function help documentation** to learn about what options are available in the case that are having import problems --- # Controlling import - missing values Missing values are often **coded differently in different raw data** e.g. `N/A`, `""`, `unknown`, etc - a **blank value** is a default missing value for R - otherwise a text value e.g `"N/A"` will be imported and not an R missing value `NA` - this can affect the column data type too ... `AGE` is `<chr>` ``` r students <- read_csv(file = "data/students.csv") ``` -- ``` #> # A tibble: 6 × 5 #> `Student ID` `Full Name` favourite.food mealPlan AGE #> <dbl> <chr> <chr> <chr> <chr> #> 1 1 "Sunil Huffmann" Strawberry yoghurt Lunch only 4 #> 2 2 "Barclay Lynn" French fries Lunch only 5 *#> 3 3 "Jayendra Lyne" N/A Breakfast and lunch 7 *#> 4 4 "Leon Rossini" Anchovies Lunch only <NA> *#> 5 5 "Chidiegwu Dunkel" Pizza Breakfast and lunch unknown #> 6 6 "G\u00fcven\u00e7 Attila" Ice cream Lunch only 6 ``` .footnote[ `N/A` = the characters "N/A" and not a missing value. `<NA>` or `NA` = true missing value. ] --- # Controlling import - missing values `na` argument - define what are missing values e.g. 'N/A', '', 'missing', etc ``` r students <- read_csv(file = "data/students.csv", na = c('', 'N/A', 'unknown')) ``` -- ``` #> # A tibble: 6 × 5 #> `Student ID` `Full Name` favourite.food mealPlan AGE #> <dbl> <chr> <chr> <chr> <dbl> #> 1 1 "Sunil Huffmann" Strawberry yoghurt Lunch only 4 #> 2 2 "Barclay Lynn" French fries Lunch only 5 *#> 3 3 "Jayendra Lyne" <NA> Breakfast and lunch 7 *#> 4 4 "Leon Rossini" Anchovies Lunch only NA *#> 5 5 "Chidiegwu Dunkel" Pizza Breakfast and lunch NA #> 6 6 "G\u00fcven\u00e7 Attila" Ice cream Lunch only 6 ``` - `favourite.food` has correct `<NA>` value - `AGE` is now `<dbl>` (numeric) --- # Controlling import - skipping lines Data files may have metadata or comments, which can be skipped with `skip` or `comment` arguments -- .pull-left[ File ``` This file was produced by ... It contains data for ... x,y,z 1,2,3 4,5,6 ``` ] -- .pull-right[ R - skip first 2 lines ``` r read_csv('files.csv', skip = 2) ``` ``` #> # A tibble: 2 × 3 #> x y z #> <dbl> <dbl> <dbl> #> 1 1 2 3 #> 2 4 5 6 ``` ] -- .pull-left[ File ``` # This file was produced by ... # It contains data for ... x,y,z 1,2,3 4,5,6 ``` ] -- .pull-right[ R - skip lines starting with # ``` r read_csv('file.csv', comment = "#") ``` ``` #> # A tibble: 2 × 3 #> x y z #> <dbl> <dbl> <dbl> #> 1 1 2 3 #> 2 4 5 6 ``` ] --- # Controlling import - columns names Sometimes files have no column names (they given elsewhere in a data dictionary) ... a data.frame **requires columns names** and the **first line is expected to be column names**. We can `col_names` to specify none or provide them ourselves .pull-left[ File (without column names) ``` 1,2,3 4,5,6 7,8,9 ``` ] -- .pull-right[ ``` r # first row becomes names read_csv('file.csv') ``` ``` #> # A tibble: 2 × 3 #> `1` `2` `3` #> <dbl> <dbl> <dbl> #> 1 4 5 6 #> 2 7 8 9 ``` ] -- .pull-left-40[ ``` r read_csv('file.csv', col_names = FALSE) ``` ``` #> # A tibble: 3 × 3 #> X1 X2 X3 #> <dbl> <dbl> <dbl> #> 1 1 2 3 #> 2 4 5 6 #> 3 7 8 9 ``` ] -- .pull-right-50[ ``` r read_csv('file.csv', col_names = c("x", "y", "z")) ``` ``` #> # A tibble: 3 × 3 #> x y z #> <dbl> <dbl> <dbl> #> 1 1 2 3 #> 2 4 5 6 #> 3 7 8 9 ``` ] --- class: middle # Importing Excel data - choosing sheets - controlling import (missing values, skipping metadata lines, importing cell ranges) --- # Importing Excel sheets with **`readxl`** The **`readxl`** package makes it easy to get data out of Excel and into R. ``` r library(readxl) ``` -- ``` r # note the sheet parameter read_excel(path = 'data/penguins.xlsx', sheet = 'Adelie') ``` -- ``` #> # A tibble: 152 × 8 #> species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year #> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> #> 1 Adelie Torgersen 39.1 18.7 181 3750 male 2007 #> 2 Adelie Torgersen 39.5 17.4 186 3800 female 2007 #> 3 Adelie Torgersen 40.3 18 195 3250 female 2007 #> 4 Adelie Torgersen NA NA NA NA <NA> 2007 #> 5 Adelie Torgersen 36.7 19.3 193 3450 female 2007 #> 6 Adelie Torgersen 39.3 20.6 190 3650 male 2007 #> 7 Adelie Torgersen 38.9 17.8 181 3625 female 2007 #> 8 Adelie Torgersen 39.2 19.6 195 4675 male 2007 #> # ℹ 144 more rows ``` .footnote[ https://readxl.tidyverse.org/ ] --- # Listing sheets in an excel file This is useful to help with identifying sheet names to import ``` #> [1] "Adelie" "Gentoo" "Chinstrap" ``` ``` r excel_sheets(path = 'data/penguins.xlsx') ``` -- <br> ``` r read_excel(path = 'data/penguins.xlsx', sheet = 'Chinstrap') ``` ``` #> # A tibble: 68 × 8 #> species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year #> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> #> 1 Chinstrap Dream 46.5 17.9 192 3500 female 2007 #> 2 Chinstrap Dream 50 19.5 196 3900 male 2007 #> 3 Chinstrap Dream 51.3 19.2 193 3650 male 2007 #> 4 Chinstrap Dream 45.4 18.7 188 3525 female 2007 #> 5 Chinstrap Dream 52.7 19.8 197 3725 male 2007 #> 6 Chinstrap Dream 45.2 17.8 198 3950 female 2007 #> 7 Chinstrap Dream 46.1 18.2 178 3250 female 2007 #> 8 Chinstrap Dream 51.3 18.2 197 3750 male 2007 #> # ℹ 60 more rows ``` --- # Controlling import - missing values, skipping rows, column names As with other **`readr`** functions ``` r # Define missing value characters read_excel(..., na = c('Missing', '', 'N/A')) ``` ``` r # Skip first 2 rows read_excel(..., skip = 2) ``` ``` r # No column names read_excel(..., col_names = FALSE) ``` --- # Controlling import - cell ranges We can define excel type ranges with the `range` argument ``` r read_excel(path = '../../data/penguins.xlsx', sheet = 'Gentoo', * range = "A1:D4") ``` ``` #> # A tibble: 3 × 4 #> species island bill_length_mm bill_depth_mm #> <chr> <chr> <dbl> <dbl> #> 1 Gentoo Biscoe 46.1 13.2 #> 2 Gentoo Biscoe 50 16.3 #> 3 Gentoo Biscoe 48.7 14.1 ``` <img src="data:image/png;base64,#images/readxl-cell-range.png" width="70%" style="display: block; margin: auto;" /> --- class: middle # Other data types --- # Other data types From R - `.rds` objects saved from R, e.g. a data.frame, vector, list - keeps any formatting previously applied in R ``` r fish <- readr::read_rds("data/fish-lengths.rds") ``` -- From statistical programs using **`haven`** from the tidyverse ``` r haven::read_sas() haven::read_spss() ``` -- Googlesheets using **`googlesheets4`** ``` r googlesheets4::read_sheet("https://docs.google.com/spreadsheets/d/1U6Cf_qEOhiR9A...") ``` And from many others e.g. databases, spatial data, genomic data, ... .footnote[ https://haven.tidyverse.org/ https://googlesheets4.tidyverse.org/ ] -- .text-xl[ Unsure which package to use? Google .text-075[`how to import {file type} in R?`] ] --- # Exporting data We typically export data from R after data cleaning and transformation. - Data is most often exported into text data files (`.csv`, etc) or R data files (`.rds`). - Exporting to Excel is done less often and requires other R packages (not explored here). - Any data type formatting applied in R will be **lost with text data files** but **retained with R data files**. The choice depends on downstream use of the data. - **Read function documentation** to know your export options -- ``` r # Export CSV write_csv(x = my_dataframe, file = 'path/to/file.csv') # Export tab delimited, NA value will be blank write_delim(x = my_dataframe, file = 'path/to/file.csv', delim = "\t", na = "") # Export an R data file write_rds(x = my_dataframe, file = 'path/to/file.rds') ``` --- class: middle, inverse # Let's import data into R ... but first ... --- class: middle # A note on data organisation in spreadsheets --- # Spreadsheets often organised for humans not computers This create immense difficulties for data import! <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#images/messy-spreadsheet.png" alt="Spreadsheet organised for human not computer" width="65%" /> <p class="caption">Spreadsheet organised for human not computer</p> </div> --- # Better data organization in spreadsheets<sup>1</sup> When collecting data, follow these basic principles: - organize the data as a single rectangle - observations as rows, variables as columns, and a single header row - do not include calculations in the raw data files - do not use font color or highlighting as data - put just one thing in a cell, - do not leave any cells empty, - be consistent - write dates like YYYY-MM-DD, - create a data dictionary - choose good names for things - make backups - use data validation to avoid data entry errors - save the data in plain text files .footnote[ [1] Broman, K. W., & Woo, K. H. (2018). Data Organization in Spreadsheets. The American Statistician, 72(1), 2–10. https://doi.org/10.1080/00031305.2017.1375989 ] --- # Better data organization in spreadsheets<sup>1</sup> Prepare your data in a **tidy** format **before** using R <img src="data:image/png;base64,#images/untidy-tidy-data.png" width="90%" /> .footnote[ Broman, K. W., & Woo, K. H. (2018). Data Organization in Spreadsheets. The American Statistician, 72(1), 2–10. https://doi.org/10.1080/00031305.2017.1375989 ] --- class: center, middle, inverse # Final exercises for module 1 .section-subtitle[ Complete the exercises in each of the two files .inverse[`M01-05-R-packages.R`] .inverse[`M01-06-data-import-export.R`] ] --- class: middle, center # End of module 1